1 00:00:00,734 --> 00:00:02,767 [Mmino] 2 00:00:03,790 --> 00:00:04,840 [Mohoeletsi] Sena ke Sesebelisoa sa Sechaba 3 00:00:07,834 --> 00:00:10,599 [Carl Malamud] Dumela, ke Carl Malamud. 4 00:00:10,599 --> 00:00:12,960 Rea u amohela ho Index ea Kakaretso. 5 00:00:12,960 --> 00:00:20,050 Re hlahisa mona bakeng sa mahala le se nang moeli sebelisa li-gram tsa n tse nkiloeng ho limilione tse 107, 6 00:00:20,050 --> 00:00:24,790 233 likete, lingoloa tsa koranta tse 728. 7 00:00:24,790 --> 00:00:29,970 Li-n-gram li tsoa mantsoeng a le mong, a tsejoang joalo ka li-unigrams, ho lipoleloana tsa mantsoe a mabeli (li-bigrams) 8 00:00:29,970 --> 00:00:33,840 ho fihlela lipoleloana tsa mantsoe a 5, tse tsejoang e le 5-gram. 9 00:00:33,840 --> 00:00:40,440 Ho na le, hohle, mela e libilione tse 355 ea n-gram tokollong ena ea index e akaretsang, e ngoe le e ngoe 10 00:00:40,440 --> 00:00:43,910 e tsamaellanang le ho otloa ha sengoloa sa koranta. 11 00:00:43,910 --> 00:00:50,579 Sesebelisoa sa lisebelisoa se bitsoang Spacy se sebelisitsoe ho ntša li-n-gram, tseo re li arotseng 12 00:00:50,579 --> 00:00:51,700 lifaele tse fetang leshome le metso e ts'eletseng. 13 00:00:51,700 --> 00:00:56,820 E amanang le mola o mong le o mong ke md5 hash, eo e emela sengoloa sa koranta. 14 00:00:56,820 --> 00:01:00,350 Ntle le moo, re thathamisa makhetlo a n-gram sengoloeng. 15 00:01:00,350 --> 00:01:05,850 Rea ts'epa ha nako e ntse e tsamaea hape ho thathamisa makhetlo ho kopase e akaretsang, metric ea bohlokoa 16 00:01:05,850 --> 00:01:08,600 e tsejoang e le TD / IDF. 17 00:01:08,600 --> 00:01:14,450 Ntle le li-gram, Index ea Kakaretso e na le mantsoe a bohlokoa a ntšitsoeng ka Yake 18 00:01:14,450 --> 00:01:20,320 Yake "e ipapisitse le lipalo-palo tsa mongolo likarolo tse nkiloeng litokomaneng tse le 'ngoe ho ea ho 19 00:01:20,320 --> 00:01:24,190 khetha mantsoe a bohlokoa ka ho fetisisa temaneng. ” 20 00:01:24,190 --> 00:01:29,890 Tafole ea mantsoe a bohlokoa a index e na le kakaretso ea mantsoe a bohlokoa a limilione tse likete tse 19. 21 00:01:29,890 --> 00:01:36,530 Qetellong, index ea kakaretso e na le metadata tafole, ho etsa 'mapa oa md5 hash ho lintho tse joalo 22 00:01:36,530 --> 00:01:42,130 aterese e ikhethang ea DOI ea sengoloa sa koranta, sehlooho, sengoli, le koranta. 23 00:01:42,130 --> 00:01:47,000 Ena ke tokollo ea pele ea index e akaretsang, mosebetsi o ntse o tsoela pele. 24 00:01:47,000 --> 00:01:49,900 Maemong a mang, ho ntša mongolo ho hlolehile. 25 00:01:49,900 --> 00:01:54,450 Ka linako tse ling, metadata ha e fumanehe kapa e teng mohlomong le e fosahetseng. 26 00:01:54,450 --> 00:01:59,830 Le ha corpus ea mantlha e le kholo, e teng ha e felle, ebile ha e nakong. 27 00:01:59,830 --> 00:02:04,430 Hona le mekhoa e mengata e ka ntlafatsoang, mme re lebelletse ho etsa General 28 00:02:04,430 --> 00:02:07,369 Index e betere ha nako e ntse e tsamaea. 29 00:02:07,369 --> 00:02:13,260 Lenane le akaretsang le nka hoo e ka bang 38 li-terabyte tsa data ka mokhoa oa Postgres 30 00:02:13,260 --> 00:02:20,010 lahla, leha ho le joalo ka khatello, ASCII ena data e fokotsoe ho li-terabyte tse 8.5. 31 00:02:20,010 --> 00:02:25,139 U ka khoasolla data ka kotloloho ho tloha mona, kapa sebelisa lisebelisoa tse kang bittorrent. 32 00:02:25,139 --> 00:02:31,540 Ikutloe u lokolohile ho etsa liipone kapa sechaba se seng lipolokelo tsa ho tsamaisa kabong e akaretsang. 33 00:02:31,540 --> 00:02:37,689 Tšepo ea rona ke hore e ka beoa palo ea tšebeliso. Re ntse re sebelisa database sa Postgres, 34 00:02:37,689 --> 00:02:42,919 le ho sebelisa index ea kakaretso ho batla mabitso a limela ho tsoa lenaneng la lihlooho. 35 00:02:42,919 --> 00:02:48,219 Lenane le ka sebelisoa ho batla tse ling dintho, tse kang lik'hemik'hale, liphatsa tsa lefutso tse sa tšoaneng 36 00:02:48,219 --> 00:02:54,300 encodings, liprotheine, lisebelisoa, mabitso a libaka, kapa mekhatlo e meng. 37 00:02:54,300 --> 00:03:00,129 Sebakeng sa database sa Postgres, mekhoa e meng, joalo ka BERT, e ka sebelisoa ho data. 38 00:03:00,129 --> 00:03:05,909 Sena ke sesebelisoa sa ho batla, bukantswe ea tsebo, 'mapa oa tsebo, sesebelisoa seo re lumelang ho sona 39 00:03:05,909 --> 00:03:10,500 ke setsi sa bohlokoa moetlong ona ea mahlale mehleng ea rona ea sejoale-joale. 40 00:03:10,500 --> 00:03:16,919 Lenane lena le akaretsang la tsebo ha le etse mokotla o fumanehang, ho ena le hoo, ke lintlha 41 00:03:16,919 --> 00:03:19,099 li ntšitsoe kopong. 42 00:03:19,099 --> 00:03:23,599 Sena ha se nahane le tšebeliso ea liphetoho. 43 00:03:23,599 --> 00:03:28,569 Re nka sena e le sesebelisoa sa sechaba ha ho na beng ba thepa ka kakaretso. 44 00:03:28,569 --> 00:03:35,599 E nehetsoe sebakeng sa sechaba, letoto la lihlooho ea lintlha tse se nang palo tseo u ka li khonang 45 00:03:35,599 --> 00:03:37,019 etsa seo o se batlang. 46 00:03:37,019 --> 00:03:40,220 Ha ho litokelo tse bolokiloeng. 47 00:03:40,220 --> 00:03:46,510 Litsebi tse tseleng ea ho sibolla li ka buisana 'mapa ona oa litsela ho fumana libaka tse potlakileng haholo 48 00:03:46,510 --> 00:03:52,540 re lakatsa ho etela.Re mpa re le mekhoa e metle ho litselana tsa libaka tsa khale tse kholo tsa lefatše, 49 00:03:52,540 --> 00:03:58,769 sechaba se kopaneng sa litsebi le bo-rasaense le baenjiniere le bataki ba li entseng 50 00:03:58,769 --> 00:04:01,370 litempele tsena tsa tsebo. 51 00:04:01,370 --> 00:04:08,219 Litempele tsena, tse tsoang Laeboraring ea Alexandria ho laeboraring ea Alexandra, ho tsoa ho baitlami 52 00:04:08,219 --> 00:04:14,239 ea Ireland ho lilaebraring tse ngata tsa sejoale-joale tsa lipatlisiso ea kajeno, ho tsoa ho bangoli ba Maarabo le ba Bajode 53 00:04:14,239 --> 00:04:19,910 ho sebetsa 'moho ho aha Ntlo ea Bohlale Baghdad ho ea lifekthering tsa tsebo tse hahiloeng 54 00:04:19,910 --> 00:04:25,071 matlong a sejoale-joale a khoebo a India Bochabela joalo ka mechine ea khatiso ea Oxford le Cambridge, tsena 55 00:04:25,071 --> 00:04:30,400 ke libaka tseo litsebi li lokelang ho etela ho tsona ho ntšetsa pele mosebetsi oa bona oa matsoho, ho ithuta ho 56 00:04:30,400 --> 00:04:31,950 tse tlileng pele ho rona. 57 00:04:31,950 --> 00:04:37,190 Litempele tsena li na le kakaretso ea tsebo eohle, empa re tlameha ho tseba ho fumana tsebo eo. 58 00:04:37,190 --> 00:04:39,500 Index ea kakaretso ke 'mapa. 59 00:04:39,500 --> 00:04:44,630 Sena ha se 'mapa oa kakaretso ea tsebo eohle ea motho, 60 00:04:44,630 --> 00:04:46,410 empa ehlile ke karoloana ea bohlokoa. 61 00:04:46,410 --> 00:04:51,960 Bakeng sa keketseho le ho hasana ha tsebo ho tsoelapele, haeba re lokela ho ema mahetleng 62 00:04:51,960 --> 00:04:56,520 ea linatla, re tlameha ho fana ka limmapa tsena ho seo lefatše le leholo la mehopolo. 63 00:04:56,520 --> 00:05:02,670 Index ea kakaretso ke sesebelisoa se le seng feela, sesebelisoa eo re ts'epileng ha nako e ntse e ea e tla ba boleng 64 00:05:02,670 --> 00:05:08,250 le ka bophara, sesebelisoa seo re tšepang hore se tla fana ka monehelo o molemo 65 00:05:08,250 --> 00:05:09,460 mokhoeng oa hau oa mahlale. 66 00:05:09,460 --> 00:05:14,121 Re tšepa hore le uena u tla beha index ena ts'ebeliso e makatsang, eo re ka etsang index 67 00:05:14,121 --> 00:05:19,670 betere, hore o ka etsa lisebelisoa tse ling, hore sechaba sa litsebi se ka sebetsa 'moho 68 00:05:19,670 --> 00:05:24,910 ka sepheo se tloaelehileng sa ho rafa boliba bo boholo ea leoatle lena la thuto. 69 00:05:24,910 --> 00:05:30,160 Ke lumela hore kaofela re arolelana sepheo se le seng, tšepo e tloaelehileng. 70 00:05:30,160 --> 00:05:35,660 Joaloka bo-rasaense, litsebi, bataki, bangoli, le le baahi ba lefatše, joalo ka ha bana ba lakatsa 71 00:05:35,660 --> 00:05:41,770 ho ithuta, ke le batho ba bohelehele, kea kholoa bohle re tšepa hore keketseho le phallo 72 00:05:41,770 --> 00:05:47,100 ea tsebo e tla ntlafatsa lefatše beha, hore e tla re thusa ho utloisisa ea rona 73 00:05:47,100 --> 00:05:53,110 lefats'e, ho folisa maloetse le bofuma, ho ea pele tsoelo-pele ea mahlale le khoebo le 74 00:05:53,110 --> 00:05:55,000 botho le bonono. 75 00:05:55,000 --> 00:06:01,140 Saense ke puo eo re tlamehang ho bua kaofela ha rona haeba re tlameha ho ntlafatsa lefatše la rona. 76 00:06:01,140 --> 00:06:03,620 Ke leboha ho mamela. 77 00:06:03,620 --> 00:06:11,890 [Tiiso ea Boea] 78 00:06:11,890 --> 00:06:13,950 [Mmino]